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Abstract. Traditionally, the MaxEnt workshops start by a tutorial day. This paper summarizes my 
talk during 2001'th workshop at John Hopkins University. The main idea in this talk is to show how 
the Bayesian inference can naturally give us all the necessary tools we need to solve real inverse 
problems: starting by simple inversion where we assume to know exactly the forward model and 
all the input model parameters up to more realistic advanced problems of myopic or blind inversion 
where we may be uncertain about the forward model and we may have noisy data. 

Starting by an introduction to inverse problems through a few examples and explaining their 
ill posedness nature, I briefly presented the main classical deterministic methods such as data 
matching and classical regularization methods to show their limitations. I then presented the main 
classical probabilistic methods based on likelihood, information theory and maximum entropy and 
the Bayesian inference framework for such problems. I show that the Bayesian framework, not 
only generalizes all these methods, but also gives us natural tools, for example, for inferring the 
uncertainty of the computed solutions, for the estimation of the hyperparameters or for handling 
myopic or blind inversion problems. Finally, through a deconvolution problem example, I presented 
a few state of the art methods based on Bayesian inference particularly designed for some of the 
mass spectrometry data processing problems. 



INTRODUCTION 
Forward and inverse problems 

In experimental science, it is hard to find an example where we can measure directly 
a desired quantity. Describing mathematical models to relate the measured quantities to 
the unknown quantity of interest is called forward modeling problem. The main object 
of a forward modeling is to be able to generate data which are as likely as possible to the 
observed data if the unknown quantity was known. But, almost always, we want to use 
this model and the observed data to make inference on the unknown quantity of interest: 
This is the inversion problem. To be more explicit, let take an example that we will use 
all along this paper to illustrate the different aspects of inverse problems. The example 
is taken from the mass spectrometry where the ideal physical quantity of interest is the 
components mass distribution of the material under the test. There are many techniques 
used in mass spectrometry. The Time-of-Flight (TOF) technique is one of them. In this 
technique, one measures the electrical current generated on the surface of a detector by 
the charged ions generated by the material under the test. Finding a very fine physical 
model to relate the time variation of this current to the distribution of the arrival times 
of the charged ions, which is itself related to the components mass distribution of the 
material under the test, is not an easy task. However, in a first approximation, assuming 



that the instrument is linear and its characteristics do not change during the acquisition 
time of the experiment, a very simple convolution model relates the raw data g(t) to the 
unknown quantity of interest f(t): 



9(r) = J f(t)h{r-t) dt, 



(1) 



where h(t) is the point spread function (psf) of the instrument. Figure 1 shows an 
example of data observed (signal in b) for a theoretical mass distribution (signal in a). 




FIGURE 1. Blurring effect in TOF mass spectrometry data: a) desired or theoretical spectrum, b) 
observed data. 

In this example, the forward problem consists in computing g given / and h which is 
given by a simple convolution operation. The inverse problem of inferring / given g and 
h is called deconvolution, the inverse problem of inferring h given g and / is called psf 
identification and the inverse problem of inferring h and / given only g is called blind 
deconvolution. 

In my talk, I have given many more examples such as image restoration 



g{x',y') 



J J f(x,y)h(x'-x,y'-y) dx dy, 



(2) 



or Fourier synthesis inversion 



g{r) = J /(o>) exp{-jur} du 



(3) 



as well as a few non linear inverse problems. I am not going to detail them here, but I 
try to give a unified method to deal with all these problems. For this purpose, first we 
note that, in all these problems, we have always limited the number of data, for example 



Hi — g( T i)y i = 1,- ■ ■ ,m. We also note that, to be able to do numerical computation, we 
need to model the unknown function / by a finite number of parameters 
As an example, we may assume that 

n 

/(0 = 5>A(0 (4) 

where bj(t) are known basis functions. With this assumption the raw data 
y = [y 1: . . . , y m ] are related to the unknown parameters x by 

n . 

Vi = g{n) = ^2 H i,j%j with H itj = / bj(t)h(t-T^) dt (5) 
j=i J 

which can be written in the simple matrix form y = Hx. The inversion problem can 
then be simplified to the estimation of x given H and y. Two approaches are then in 
competition: 

i) the dimensional control approach which consists in an appropriate choice of the basis 
functions bj (r) and n < m in such a way that the equation y = Ax be well conditioned; 

ii) the more general regularization approach where a classical sampling basis for bj(r) 
with desired resolution is chosen no matter if n > m or if A is ill conditioned. In 
the following, we follow the second approach which is more flexible for adding more 
general prior information on x. 

We must also remark that, in general, it is very difficult to give a very fine mathe- 
matical model to take account for all the different quantities affecting the measurement 
process. However, we can almost always come up with a more general relation such as 

yi = hQ(x) + 6i, i = l,...,m (6) 

where represents the unknown parameters of the forward model (for example the 
amplitude and the width of a Gaussian shape psf in a deconvolution problem) and 
e = [ex,... ,e m ] represents all the errors (measurement noise, discretization errors and 
all the other uncertainties of the model). For the case of linear models we have 

y = H e x + e. (7) 

In this paper we focus on this general problem. We first consider the case where the 
model is assumed to be perfectly known. This is the simple inversion problem. Then we 
consider the more general case where we have also to infer on 6. This is the myopic or 
blind inversion problem. 

Even in the simplest case of perfectly known linear system and exact data: 

i) the operator H may not be invertible (H^ 1 does not exist); 

ii) it may admit more than one inverse (3Gi and G 2 \Gi(H) = G 2 (H) = I where i" is 
the identity operator); or 

iii) it may be very ill-posed or ill-conditioned (meaning that there exists x and x + ctbx 
for which \\H^ 1 (x) — H^ 1 (x + a5x)\\ never vanishes even if a i— > [1, 2]. 

These are the three necessary conditions of existence, uniqueness and stability of 
Hadamard for the well-posedness of an inversion problem. This explains the fact that, 



in general, even in this simple case, many naive methods based on generalized inversion 
or on least squares may not give satisfactory results. The following figure shows, in a 
simple way, the ill-posedness of a deconvolution problem. On this figure, we see that 
three different input signals can result three outputs which are practically indistinguish- 
able from each other. This means that, data matching alone can not distinguish between 
any of these inputs. 
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FIGURE 2. 

able outputs. 



Ill-posedness of a deconvolution problem: Inputs on the left give practically indistinguish- 



As a conclusion, we see that, apart from the data, we need extra information. The art of 
inversion in a particular inverse problem is how to include just enough prior information 
to obtain a satisfactory result. In the following, first we summarize the classical deter- 
ministic approaches of data matching and regularization. Then, we focus on probabilistic 
approaches where errors and uncertainties are taken into account through the probability 
laws. Here, we distinguish, three classes of methods: those which only account for the 
data errors (error probability distribution matching and likelihood based methods), those 
which only account for uncertainties of unknown parameters (entropy based methods) 
and those which account for both of them (Bayesian inference approach). 



DATA MATCHING AND REGULARIZATION METHODS 

Exact data matching 



Let consider the discretized equation y,i = hi(x) + i — 1, . . . ,m; and assume first 
that the model and data are exact (q = 0). We can then write y = h(x). 



Assume now the system of equations is under determined, i.e., there is more than one 
solution satisfying it (for example when the number of data is less than the number of 
unknowns). Then, one way to obtain a unique solution is to define an a priori criterion, 
for example A (a;, m) to choose that unique solution by 

x = argmin{A(a3,m)} (8) 
h(x)=y 

where m is an a priori solution and A a distance measure. 

In the linear inverse problems case, the solution to this constrained optimization 
can be obtained via Lagrangian techniques which consists in defining the Lagrangian 
C{x, X) = A(x, m) + X t (y — Hx) and searching for (A, x) through 

= argmin^{"D(A) = inl x C(x, A)} 
= argmin a ,|£(a;,A)| ^ 

Noting that Va;£ = VxA(x ) m) — H l X and V \C = y — Hx and defining Q(s,m) = 
sup x {x t s — A(x, m)} the algorithm to find the solution x becomes: 

- Determine Q(s,m) = snp x {x t s — A(x,m)}; 

- Find A = arg min A {£>( A) = X l y - g{H l X, m) } ; 

- Determine x = V s ^(Jf*A). 

As an example, when A(x,m) = \ \\x — m\\ 2 then Q(s,m) = m t s + \ ||s|| 2 , V sQ = 

m + s and V(X) = X l y - m'H'X + \ \H l xf which results to A = (HH t )- 1 (y - 
Hm) and the solution is given by 

x = m + H\HH t )-\y-Hm). (10) 

One can remark that, when m = we have x = H t (HH t )~ 1 y and this is the classical 
minimum norm generalized inverse solution. 

Another example is the classical Maximum Entropy method case where A(x,m) = 
KL(x, m) is the Kullback-Leibler distance or cross entropy between x and the a priori 
solution m: 

KL(x,m) = ^rrjln— — (xj — rrij) (11) 

3 3 

Here, the solution is given by 

Xj — rrij exp -[A* A]j with A = argmin{P(A) = X t y-Q(A t \,m)} (12) 
L J A 

where Q(s, m) = ^) ■ rrij (1 — exp [— Sj]). But, unfortunately here V(X) is not a quadratic 

function of A and thus there is not an analytic expression for A. However, it can 
be computed numerically and many algorithms have been proposed for its efficient 
computation. See for example [3] and the cited references for more discussions on the 
computational issues and algorithm implementation. 



The main issue here is that, this approach gives a satisfactory solution to the unique- 
ness of the inverse problem, but in general, the performances obtained by the resulting 
algorithms stay sensitive to error on the data. 



Least squares data matching and regularization 

When the discretized equation y = h(x) is over-determined, i.e., there is no solution 
satisfying it exactly (for example when the number of data is greater than the number of 
unknowns or when the data are not exact), one can try to estimate them by: 

x = axgmm{A(y,h(x))} , (13) 
x 

where A(y, h(x)) is a distance measure in the data space. The case where A(y , h(x)) = 
\\y — h(x) || 2 is the classical Least Squares (LS) criterion. 

For a linear inversion problem y = Hx, it is easy to see that any x which satisfies 
the normal equation H l Hx = H l y is a LS solution. If H l H is invertible and well- 
conditioned then x = [H^H^^y is again the unique generalized inverse solution. 
But, in general, this is not the case: H l H is rank deficient and we need to constrain the 
space of the admissible solutions. The constraint LS is then defined as 

x = argmin { \\y — Hx || 2 } . (14) 
xeC 

where C is a convex set. The choice of the set C is primordial to satisfy the three 
conditions of a well-posed solution. An example is the positivity constraint: C = {x : 
Vj, Xj > } . Another example isC = {x: \\x\\ <a} where the solution can be computed 
via the optimization of 

J(x) = \\y-H(x)\\ 2 + \\\x\\. (15) 

The main technical difficulty is the relation between a and A. The minimum norm LS 
solution can also be computed using the singular value decomposition [4]. The main 
issue here is that, even if this approach has been well understood and commonly used, 
it assumes implicitly that the noise and the x are Gaussian. This may not be suitable in 
some applications, and more specifically in mass spectrometry data processing where 
the unknowns are spiky spectra. 

A more general regularization procedure is to define the solution to the inversion 
problem y = H(x) + e as the optimizer of a compound criterion J(x) = \\y — Hx\\ 2 + 
\(j)(x) or the more general criterion 

J(x) = A 1 (y,Hx) + XA 2 (x,m). (16) 

where A 1 and A 2 are two distances or discrepancy measures, A a regularization param- 
eter and m an a priori solution[5]. The main questions here are: i) how to choose Ai 
and A 2 and ii) how to determine A and m. 



For the first question, many choices exist: 

- Quadratic or L 2 distance: A(x,z) = \\x — z\\ 2 = J2j( x j ~~ z j) 2 > 

- L p distance: A(x, z) = \\x — z\\ p = . \ x j — Zj\ p ; 

- Kullback distance: A(x, z) = Yl<j x j ^ n ( x j/ z j) ~ ( x j ~ z j)' 

- Roughness distance: A(x,z) any of the previous distances with Zj = Xj-i or Zj = 
(xj-i +Xj + i)/2 or any linear function Zj = ip(x k: k G where N(j) stands for 
the neighborhood of j. (One can see the link between this last case and the Gibbsian 
energies in the Markovian modeling of signals and images.) 

The second difficulty in this deterministic approach is the determination of the reg- 
ularization parameter A. Even if there are some techniques based on cross validation 
[6, 7, 8], there is not natural tools for their extension to other hyperparameters in a nat- 
ural way. 

As a simple example, we consider the case where both A 1 and A 2 are quadratic: 
J(x) = \\y — Hx\\yy + A \\x — m\\ 2 Q. The optimization problem, in this case, has an 
analytic solution: 

x = (H'WH + XQy 1 (H'Wy-Qm)) (17) 

which can also be written 

x = m + Q _1 Jf* (HQ l H l + X^W' 1 ) ^ (y - Hm) (18) 

which is a linear function of the a priori solution m and the data y. Note also 
that when m = 0, Q = I and W = I we have x = (H l H + A/) 1 H l y or x = 
H* (HH* + A -1 J) y and when A = we obtain the generalized inverse solutions 

x = (H l H) ^ H f y orx = H t (HH l ) ^ y. 

As we mentioned before, the main practical difficulties in this approach are the choice 
of Ai and A 2 and the determination of the hyperparameters A and the inverse covariance 
matrices W and Q. 

As a main conclusion on these deterministic inversion methods, we can say that, 
even if, in practice, they are used and give satisfaction, they lack tools to handle with 
uncertainties and to account for more precise a priori knowledge of statistical properties 
of errors and unknown parameters. The probabilistic methods can exactly handle more 
easily these problems as we will see in the following. 



PROBABILISTIC METHODS 
Probability distribution matching and maximum likelihood 

The main idea here is to account for data and model uncertainty through the as- 
signment of a theoretical distribution p Y \x(y\ x ) to the data. In probability distribution 
matching method, the main idea is to determine the unknown parameters x by minimiz- 
ing a distance measure A(p,p) between the empirical histogram p of the data defined 



as 



i 

and the theoretical distribution of the data p Y \ x {z\x). 

When A(p,p) is choosed to be the Kullback-Leibler mismatch measure 

A f p(z) 
KL[p,p] = p(z)\n P \\ dz 
J py\ X (z\x) 



= - J P( z ) m Vy\x (z I x) dz + J p(z)\np(z) dz 



(20) 



we have 



x = argmin {KL [p,p]} = argmin < — / p(z) \a.p Y \x{z\x) dz > . (21) 
x x { J J 

It is then easy to see that, for the i.i.d. data, this estimate becomes equivalent to the 
maximum likelihood (ML) estimate 

x = argmin {-\np Y \x(z\x)\ z=y } = argm£Lx{p Y \ x (y\x)} . (22) 

X X 

In the case of a linear model and Gaussian noise, it is easy to show that the ML estimate 
becomes equivalent to the LS one, which in general, does not give satisfactory results as 
we have discussed it in the previous section. 

The important point to note here is that, in this approach, only the data uncertainty 
is considered and modeled through the probabilty law p Y \ x (y\x). We will see in the 
following that, in contrary to this approach, in information theory and maximum entropy 
methods, the data and model are assumed to be exact and only the uncertainty of 
x is modeled through an a priori reference measure fi(x) which is updated to an a 
posteriori probabilty law p(x) by optimizing the KL mismatch KL(p,p) subject to the 
data constraints. 



Maximum entropy in the mean 

The main idea in this approach is to consider x as the mean value of a quantity X EC, 
where C is a compact set on which we want to define a probability law P: x = E P {X} 
and the data y as exact equality constraints on it: 



y = Hx = HE P {X} = 




Then, assuming that we can translate our prior information on the unknowns through a 
prior law (a reference measure) dp(x), we can determine the distribution P by: 

r dP(x) 

maximize - / In - ) ( dP(x) s.t. y = Hx = HE P {X} . (24) 

Jc M x ) 



The solution is obtained via the Lagrangian: 



£(*,A)=/ 



ln dP(x)_ A 

dfj,(x) 



dP(x) 



and is given by: dP(x, A) = exp [A* [H x] — In Z( A)] dfi(x), where 

Z(X) = / exp [A* [if a;]] dfi(x). The Lagrange parameters are obtained by searching 
J 0, 

the unique solution (if exists) of the following system of non linear equations: 

d -^ = V., i = l,-,M. (25) 
oXi 

Then, naturally, the solution to the inverse problem is defined as the expected value of 
this distribution: x(X) = E P {X} — fx dP(x, A). The interesting point here is that, 

the solution x(X) can be computed without actually computing P in two ways: 

- Via optimization of a dual criterion: The solution x is expressed as a function of the 
dual variable s = H l X by x(s) = VsG(s, m) where 

G(s,m) = In Z(s,m) = In J exp [s*cc] dfj,(x), m = E fl {X} = J x d/i(x) 

and \ = argmax{D(\)=\ t y-G(H t \)}. 
A 

- Via optimization of a primal or direct criterion: 

x = argmm{H(x,m)} s.t. y = Hx where H(x,m) = snp{s t x — G(s,m)}. 
xeC s 

Another interesting point is the link between these two options: 

i) Functions G and H depend on the reference measure n(x); 

ii) The dual criterion D{\) depends on the data and the function G; 

iii) The primal criterion H(x, m) is a distance measure between x and m which means: 
H(x, m) > and H(x, m) = iff x = m; H(x, m) is differentiable and convex 
on C and H(x,m) = oo \ix^C\ 

iv) If the reference measure is separable: /i(x) = Ylf =1 [J>j(xj) then P is too: 
dP(x, A) = n^Li dPj(xj,\) and we have 

G(s,m) = 53ft-(sj,mj), H(x,m) = ^hjix^rrij), Xj = cf j {sj,m j ). 

3 3 

where gj is the log Laplace transform (Cramer transform) of nf. 

gj(s) = In J exp [sx] dfij(x); 
and /ij is the convex conjugate of gf hj(x) = max{sx — gj(s)}. 



The following table gives three examples of choices for /ij and the resulting expres- 
sions for Qj and hf 







9j( s ) 


hj(x,m) 


Gaussian: 


cxp — (1/2) (x — m) 2 


(1/2) (s-mf 


(l/2)(x-m)' z 


Poisson: 


(m x /x\) exp [— m\ 


exp [m — s\ 


—x In (x/m) + m — x 


Gamma: 


x a ~ L exp [— (x/m)] 


ln(s — m) 


— In (x/m) + (x/m) — 1 



We may remark that the two famous expressions of the Burg \nx and Shannon — a; In a; 
entropies are obtained as special cases. 

As a conclusion, we see that the Maximum entropy in mean extends in some way the 
classical ME approach by giving other expressions for the criterion to optimize. Indeed, 
it can be shown that when we optimize a convex criterion subject to the data constraints 
we are optimizing the entropy of some quantity related to the unknowns and vise versa. 
However, as we have mentioned, basically, in this approach the data and the model are 
assumed to be exact even if some extensions to the approach gives the possibility to 
account for the errors [9]. In the next section, we see how the Bayesian approach can 
naturally account for both uncertainties on the data and on the unknown parameters x. 



BAYESIAN INFERENCE APPROACH 

In Bayesian approach, the main idea is to translate our prior knowledge on the errors 
e and on the unknowns x to prior probability laws p(e) and p(x). The next step is to 
use the forward model and p(e) to deduce p(y\x). The Bayes rule can then be used to 
determine the posterior law of the unknowns p(x\y) from which we can deduce any 
information about the unknowns x. The posterior p(x\y) is thus the final product of the 
Bayesian approach. However, very often, we need a last step which is to take out the 
necessary information about x from this posterior. The tools for this last step are the 
decision and estimation theories. 

To illustrate this, let consider the case of linear inverse problems y = Hx + e. The 
first step is to write down explicitly our hypothesis: starting by the hypothesis that e is 
zero-mean (no systematic error), white (no time correlation for the errors) and assuming 
that we may only have some idea about its energy a\ = 1/(20!), and using either 
the intuition or the Maximum Entropy Principle (MEP) lead to a Gaussian prior law: 
e ~ A/*(0, 1/(20!)/). Then, using the forward model with this assumption leads to: 

p(y\x,(f) 1 ) ocexp [-(f>i\\y - Hx\\ 2 ] . (26) 

The next step is to assign a prior law to the unknowns x. This step is more difficult and 
needs more caution. In inverse problems, as we presented, x represents the samples of a 
signal or the pixel values of an aerian image. Very often then we have ensemblist prior 
knowledge about the signals or images concerned by the application and we can model 
them. The art of the engineer is then to choose the appropriate model and to translate 
this information to a probability law to reflect it. 

Again here, let illustrate this step, first through a few general examples and then more 
specifically the case of mass spectrometry deconvolution problem. 



In the first example, we assume that, a priori we do not have (or we do not want 
or we are not able to account for) any knowledge about the correlation between the 
components of x. This leads us to 

p(x) Hl'.i-rj)- (27) 

3 

Now, we have to assign Pj(xj). For this, we may assume to know the mean values m 3 - 
and some idea about the dispersions about these mean values. This again leads us to 
Gaussian laws M{rrij,a1), and if we assume the same dispersions a^. — l/ (20 2 ), Vj we 
obtain 

= exp [— 02 — m\\ 2 ] (28) 

With these assumptions, using the Bayes rule, we obtain 

p(x\y) oc exp \—4>i \\y — Hx\\ 2 — 2 ||a; — m|| 2 ] . (29) 

This posterior law contains all the information we can have on x (combination of our 
prior knowledge and data). If x was a scalar or a vector of only two components, we 
could plot the probability distribution and look at it. But, in practical applications, x may 
be a vector with huge number of components. Then, even if we can obtain an expression 
for this posterior, we may need to summarize its information content. In general then, we 
may choose, equivalently, between summarizing it by its mode, mean, marginal modes, 
etc ... , or use the decision and estimation theory to define point estimators to be used 
to compute (best representing values). For example, we can choose the value x which 
corresponds to the mode of p(x\y)- the Maximum a posteriori (MAP) estimate, or the 
value x which corresponds to the mean of this posterior- the Posterior mean (PM) 
estimate, or when interested to the component Xj, to choose Xj corresponding to the 
mode of the posterior marginal p(xj\y). 

We can also generate samples from this posterior and just look at them as a movie 
or use them to compute the PM estimate. We can also use it to compute the posterior 
covariance matrix (P = E{(x — x)(x — x) 1 } where x is the posterior mean), from 
which we can infer on the uncertainty of the proposed solutions. 

In the Gaussian priors case we just presented, it is easy to see that, the posterior law is 
also Gaussian and all these estimates are the same and can be computed by minimizing 

J(x) = — \np(x\y) = \\y — Hx\\ 2 + A \\x — m\\ 2 with A = ^ = (30) 

We may note here the analogy with the quadratic reg^ularization criterion (16) with the 
emphasis that the choice Ai(y,Hx) = \\y — Hx\\ and A 2 (x,m) = \\x — m\\ are 
the direct consequences of Gaussian choices for prior laws of the noise p(e) and the 
unknowns p{x). 

The Gaussian choice for Pj(xj) is not always a pertinent one. For example, we may 
a priori know that the distribution of Xj around their means rrij are more concentrated 



p(x) oc exp 



-<fa^2\xj- 



rrij 



but great deviations from them are also more likely than a Gaussian distribution. This 
knowledge can be translated by choosing a Generalized Gaussian law: 



p(xj) oc exp 



2al 



\Xj 7Jlj\ 



l<p<2. (31) 



In some cases we may know more, for example we may know that Xj are positive 
values. Then a Gamma prior law 

p(xj) = Q(a,m,j) oc (xj/mj)~ a exp [—Xj/m,j] (32) 

would be a better choice. 

In some other cases we may know that xj are discrete positive values. Then a Poisson 
prior law 

777, 

p(xj) oc — y exp [— rrij] (33) 

Xj. 

is a better choice. 

In all these cases, the expression of the posterior is p(x\y) oc exp [— J(x)] with J (a;) = 
\\y — Hx\\ 2 + \<f>(x) where <f>(x) = — lnp(x). It is interesting to note the different 
expressions of 4>(x) for the prior laws discussed and remark that they contain different 
entropy expressions for the x. 

The last general example is the case where a priori we know that Xj are not indepen- 
dent, for example when they represents the pixels of an aerian image. We may then use 
a Markovian modeling 

p(xj\x k ,ke S) =p(xj\x k ,k eJ\f(j)), (34) 

where S = {1, . . . ,N} stands for the whole set of pixels and Af(j) = [k:\k- j\ <r} 
stands for r-th order neighborhood of j. 

With some assumptions on the border limits, such models again result to the opti- 
mization of the same criterion with 

cf)(x) = A 2 (x,z) = ^^(f)(xj,Zj) where Zj = ip(x k ,k G Af(j)) (35) 
j 

with different potential functions (f)(xj,Zj). 

A simple example is the case where Zj = Xj-i and (j)(xj,Zj) any function in between 
the following: 

\xj — Zj\ a , a: In — H — -, Xj\n — + (xj — Zj) 

Zj Zj Zj 

See ([10, 11, 12, 13]) for some more discussion and properties of these potential func- 
tions. 

As one of the main conclusions here, we see that, as it concerns the MAP estimation, 
the Bayesian approach is equivalent to the general regularization. However, here the 



choice of the distance measure Ai(y,Hx) depends on the forward model and the 
hypothesis on the noise and the choice of the distance measure A 2 (x,m) depends on 
the prior law chosen for x. 

One more extra feature here is that, we have access to the whole posterior p(x | y) from 
which, not only we can define an estimate but also, we can quantify its corresponding 
remained uncertainty. We can also compare posterior and prior laws of the unknowns to 
measure the amount of information contained in the observed data. Finally, as we will 
see in the following, we have finer tools to model unknown signals or images and to 
estimate the hyperparameters. 



OPEN PROBLEMS AND ADVANCED METHODS 

As we have remarked in previous sections, in general, the solution of an inverse problem 
depends on our prior hypothesis on errors e and on x. Before applying the Bayes rule, 
we have to assign the prior laws to them. From the forward model and assumptions 
on e we assign p(y\x,(p 1 ) and from the assumptions on x we assign p{x\(p 2 ). This 
step is one of the most crucial part of the applicability of the Bayesian framework for 
inverse problems. Modeling a signal and finding the corresponding expression for the 
prior law p(x\<j> 2 ) is not an easy task. This choice may have many consequences: the 
complexity of the computation of the posterior and consequently the computation of 
any point estimators such as MAP (which needs optimization) or PM (which needs 
integration either analytically or by Monte Carlo methods). This modeling depends also 
on the application. We discuss this point through the particular deconvolution problem 
in mass spectrometry. 



Appropriate modeling of input signal 

We actually had started this discussion in previous section and we saw that, at least 
for linear inverse problems with a white Gaussian assumption of the noise, the posterior 
has for expression: p(x\y) oc exp [— J(x)\ with 

J(x) = \\y - Hxf + \<j>{x) (36) 

with 4>(x) = — \np(x). Thus the expression and properties of J(x), and consequently 
those of the posterior p(x\y) depend on the prior p(x). For example if p(x) is Gaussian 
then 

4>(x) = — \np(x) = y~]a;j 

3 

is a quadratic function of x. Then the MAP or PM estimates have the same values and 
their computation needs the optimization of a quadratic criterion which can be done 
either analytically or by using any simple gradient based algorithm. But the Gaussian 
modeling is not always an appropriate one. Let take our example of deconvolution of 



mass spectrometry data. We know a priori that the input signal must be positive. Then a 
truncated Gaussian will be a better choice: 



(j>(x) — if Xj > 0; else 0(a?) = oo. 

But, we know still more about the input signal: it has pulse shapes, meaning that, if we 
look at the histogram of the samples of a typical signal, we see that great number of 
samples are near to zero but great deviations from this background are not rare. Thus, a 
generalized Gaussian 

4>(x) = ^ \xj\ p with 1 < p < 2; if x 3 > 0; else (j>(x) = oo. 
j 

or a Gamma prior law 

(f>(x) = y^lnxj + Xj if Xj > 0;else (f>(x) = oo. 
j 

would be better choices. 

We can also go further in details and want to account for the fact that we are looking 
for atomic pulses. Then we can imagine a binary valued random vector z with p(zj = 
I) — a and p(zj = 0) = 1 — a, and describe the distribution of x hierarchically: 

p(xj\z j ) = z j po(xj) (37) 

with po(xj) being either a Gaussian p(xj) = Af(m, a 2 ) or a Gamma law p(xj) = Q (a, b). 
The second choice is more appropriate while the first results on simpler estimation 
algorithms. The inference can then be done through the joint posterior 

p(x,z\y) (xp(y\x)p(x\z)p(z) (38) 

The estimation of z is then called Detection and that of x Estimation. The case where 
we assume p(z) = Y\.jP{ z j) — ® ni (l — a) (jl ~ ni ^ with n\ the number of ones and n the 
length of the vector z, is called Bernoulli process and this modelization for x is called 
Bernoulli-Gaussian or Bernoulli-Gamma as a function of the choice for p (xj). 

The difficult step in this modeling is the detection step which needs the computation 

of 

p(z\y)(xp{z) J p(y\x)p(x\z)dx (39) 

and then its optimization over {0, l} n where n is the length of the vector z. The cost of 
the computation of the exact solution is huge (a combinatorial problem). 

Many approximations to this optimization have been proposed which result to differ- 
ent algorithms for this detection-estimation problem [14]. Many Monte Carlo techniques 
have also been proposed for generating samples of z and x from the posterior and thus 
compute the PM estimates of x. Giving more details on this modeling and details of 
corresponding algorithms is out of the scope of this paper. 



The results on the following figure illustrate this discussion. Here, we used the data 
in figure 1 and computed x by optimizing the MAP criterion (36), with different prior 
laws p(x) oc exp [— X(/)(x)] in between the following choices: 

a) Gaussian: <j)(x) = 

b) Gaussian truncated on positive axis: 4>(x) = J2 X % x j > 0' 

c) Generalized Gaussian truncated on positive axis: <fi(x) = Yl\ x j \ p w i m P — !•!> x j > 0- 

d) Entropic prior (j)(x) — ^Xjlaxj — Xj, Xj > 0, 

e) Gamma prior: <j>(x) = J^liiXj + Xj, Xj > 0. 



FIGURE 3. Deconvolution results with different priors: a) Gaussian b) Gaussian truncated to positive 
axis c) Generalized Gaussian, d) — x\nx entropic prior e) In a: entropic or Gamma prior. 



As it can be seen from these results 1 , for this application, the Gaussian prior does not 
give satisfactory result, but in almost all the other cases the results are more satisfactory, 
because the corresponding priors are more in agreement with the nature of the unknown 
input signal. 
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FIGURE 4. Plots of the different prior laws p(x) oc exp [— \<f>(x)]: a) Truncated Gaussian tfi(x) = 
x 2 , A = 3 b) Truncated generalized Gaussian <j)(x) = x p , p = 1.1, A = 4; c) Entropic <f>(x) = x\nx — 
x, A = 10 d) Entropic <f)(x) = \nx + x,X = 0.1. 

The Gaussian prior (a) is not at all appropriate, Gaussian truncated to positive axis (b) 
is a better choice. The generalized Gaussian truncated to positive axis (c) and the — xlnx 
entropic priors (d) give also almost the same results than the truncated Gaussian case. 
The Gamma prior (e) seems to give slightly better result (less missing and less artifacts) 
than all the others. This can be explained if we compare the shape of all these priors 
shown in figure (4). The Gamma prior is sharper near to zero and has longer tail than 
other priors. It thus favorites signals with greater number of samples near to zero and still 
leaves the possibility to have very high amplitude pulses. However, we must be careful 
on this interpretation, because all these results depend also on the hyperparameter A 
whose value may be critical for this conclusion. In these experiments we used the same 
value for all cases. This brings us to the next open problem which is the determination 
of the hyperparameters. 



Remark that the results are presented on a logarithmic scale for the amplitudes to show in more detail 
the low amplitude pulses. We used log(l + a) scale in place of y scale which has the advantage of being 
equal to zero for a=0. 



Hyperparameter estimation 



The Bayesian approach can be exactly applied when the direct (prior) probability laws 
p(y\x, and p(x\<f> 2 ) are assigned. Even, when we have chosen appropriate laws, still 
we have to determine their parameters (p = [4> 1 ,4> 2 \. This problem has been addressed 
by many authors and the subject is an active area in statistics. See [15, 16, 17, 6], 
[18, 19, 20, 21, 22] and also [8, 23, 24]. 

The Bayesian approach gives natural tools to handle this problem by considering 
4> = (4> 1: 4> 2 ) as extra unknown parameters to infer on. We may then assign a prior law 
p{(p) to them too. However, the way to do this is also still an open problem. We do not 
discuss it more in this paper. The readers are invited to see [25] for some extended 
discussions and references. When this step is done, we can again use the Bayesian 
approach and compute the joint posterior p(x,<p\y) from which we can follow three 
main directions: 

- Joint MAP optimization: In this approach one tries to estimate both the hyperparame- 
ters and the unknown variables x directly from the data by defining: 

(x, 4>) = argmax{p(a3, <p\yj) wherep(cc, <p\y) cxp(y\x,4>)p(x\(p)p((f>) (40) 



and where p(<p) is an appropriate prior law for <p. Many authors used the non informative 
prior law for them. 

- Marginalization: The main idea in this approach is to distinguish between the two sets 
of unknowns: a high dimensional vector x representing in general a physical quantity 
and a low dimensional vector <p representing the parameters of its prior probability laws. 
This argument leads to estimate first the hyperparameters by marginalizing over the 
unknown variables x: 



Note also that when p(<p) is choosed to be uniform, then p(<p\y) oc p(y\4>) which is the 
likelihood of the hyperparameters <fi an d the corresponding maximum likelihood (ML) 
estimate has all the good asymptotic properties which may not be the case for the joint 
MAP estimation. However, for practical applications with finite data we may not care 
too much about the asymptotic properties of these estimates. 

- Nuisance parameters: In this approach the hyperparameters are considered as the 
nuisance parameters, so integrated out of p(x, <p\y) to obtain p(x\y) and x is estimated 



{X,<f>) 




(41) 



and then, using them in the estimation of the unknown variables x: 




(42) 



by 




(43) 



- Joint Posterior Mean: Here, x and </> are estimated as the posterior means: 



x = E{x\y} = J xp(x\y) dx and </> = E{(p\y} = J 4>p{<fi\y) d(f>. (44) 

The main issue here is that, excepted the first approach, all the others need integrations 
for which, in general, there is not analytical expressions and their numerical computation 
cost may be very high. At the other hand, unfortunately, the estimation by the joint 
maximization has not the good asymptotic properties (when number of data goes to 
infinity) of the estimators obtained through the marginalization or expectation. However, 
in finite number of data, a comparison of their relative properties is still to be done. To 
see some more discussions and different possible implementations of these approaches 
see [24]. We have also to mention that, we can always use the Markov Chain Monte 
Carlo (MCMC) techniques to generate samples from the joint posterior p(x, <p\y) and 
then compute the joint posterior means and corresponding variances. It seems that these 
techniques are growing up. However, I see two main limitations for their application on 
real data: their huge computational cost and the need for some discussions on the tools 
to control their convergences. 



Myopic or blind inversion problems 

Consider the deconvolution problems (1) or (2) and assume now that the psf h(t) or 
h(x,y) are partially known. For example, we know they have Gaussian shape, but the 
amplitude a and the width a of the Gaussian are unknown. Noting by 6 = (a, a) the 
problem then becomes the estimation of both x and from y = Hqx + e. The case 
where we know only the support of the psf but not its shape can also be casted in the 
same way with 6 = [h(0), . . . , h(p)} 

Before going more in details, we must note that, in general, the blind inversion prob- 
lems are much harder than the simple inversion. Taking the deconvolution problem, we 
have seen in introduction that, the problem even when the psf is given is ill-posed. The 
blind deconvolution then is still more ill-posed, because here there are more fundamental 
under-determinations. For example, it is easy to see that, we can find an infinite num- 
ber of pairs (h,x) which result to the same convolution product h*x. This means that, 
to find satisfactory methods and algorithms for these problems need much more prior 
knowledge both on x and on h, and in general, the inputs must have more structures (be 
rich in information content) to be able to obtain satisfactory results. 

Conceptually however, the problem is identical to the estimation of hyperparameters 
in previous section and any of the four approaches presented there can be used. One 
may wish however to distinguish between these parameters of the system = (a, a) and 
those hyperparameters of the prior law model descriptions </> = (of, of, . . . ). In that case, 
one can try to write down p(x, 0, <p\y) and use one of the following: 

- Joint MAP estimation of x, 6 and 4>\ (x,0,<pi) = argmax{p(x,6,(fi\y)}. 

- Marginalize over x and estimate 6 and </> using: (9,<p) = argmax{p(0, <j)\y)} and 



then, estimate x using: x = argmax^ <^p(x\y,0,4>)j. 

- Marginalize over x and and estimate <f> using: </> = argmax{p(0|y)} „ 

then estimate using: = argmax^ and finally, estimate x using: 

x = argmax^, {p(x\y,0, $)}. 

- Joint Posterior Mean: Here, x, and <p are estimated through their respective posterior 
means: x = E{x\y}, = E{0\y} and <p = E{<p\y}. 

Here again, the joint optimization stays the simpler but we must be careful on in- 
terpretation of the results. For others, one can either use the Expectation-Maximization 
(EM) algorithms and/or MCMC sampling tools to approximately compute the necessary 
integration or expectation computations and overcome the computational cost issues. 



CONCLUSIONS 

In this paper I presented a synthetic overview of methods for inversion problems starting 
by deterministic data matching and regularization methods followed by a general pre- 
sentation of the probabilistic methods such as error probability law matching and like- 
lihood based and the information theory and maximum entropy based methods. Then, 
I focused on the Bayesian inference. I show that, as it concerns the maximum a pos- 
teriori estimation method, one can see easily the link with regularization methods. We 
discussed however the superiority of the Bayesian framework which gives naturally the 
necessary tools for inferring the uncertainty of the computed solution, for the estimation 
of the hyperparameters or for handling myopic and blind inversion problems. We saw 
also that probabilistic modeling of signal and images is more flexible for introduction of 
practical prior knowledge about them. Finally, we illustrated some of these discussions 
through a deconvolution example in mass spectrometry data processing. 
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